Coding style: 1 point
library(tidyverse)
OIS v3 5.44: Teaching descriptive statistics.
A study compared five different methods for teaching descriptive statistics.
What are the hypotheses for evaluating
What are the degrees of freedom associated with the F -test
Suppose the p-value for this test is 0.0168.
ANSWER:
The hypotheses for evaluating if the average test scores are different for the different teaching methods:
Null Hypothesis (\(H_0\)): all means across all 5 methods are the same (\(\mu_1 = \mu_2 = \mu_3 = \mu_4 = \mu_5\))
Alternative Hypothesis (\(H_a\)): at least one pair of the means is the same
The degrees of freedom associated with the F-test for evaluating these hypotheses:
Degrees of freedom for treatment: 5-4 = 1
Degrees of freedom for error: 45-5 = 40
The conclusion for this test where the p-value = 0.0168:
Assuming standard significance level of \(\alpha\) = 0.05, reject the null hypothesis becuase p-value < \(\alpha\) => 0.0168 < 0.05
OIS v3 7.12: Trees.
This dataset
The diameter of the tree is measured
library(tidyverse)
library(ggplot2)
data(trees)
summary(trees)
Girth Height Volume
Min. : 8.30 Min. :63 Min. :10.20
1st Qu.:11.05 1st Qu.:72 1st Qu.:19.40
Median :12.90 Median :76 Median :24.20
Mean :13.25 Mean :76 Mean :30.17
3rd Qu.:15.25 3rd Qu.:80 3rd Qu.:37.30
Max. :20.60 Max. :87 Max. :77.00
Visualize the relationships in the data
ggplot(trees, aes(x = Volume, y = Girth)) + ggtitle("Diameter by Volume") + geom_point() + geom_smooth()
ggplot(trees, aes(x = Height, y = Girth)) + ggtitle("Height by Volume") + geom_point() + geom_smooth()
Let’s answer questions using linear regression
Volume and Diameter (Girth): Using the geom_smooth function, there is a positive linear association between volume and diameter.
Volume and Height: Using the geom_smooth function, there is a positive linear association between volume and height
Summarizing the model results from the lm() function
girth_lm <- lm(Volume ~ Girth, trees)
height_lm <- lm(Volume ~ Height, trees)
girth_lm
Call:
lm(formula = Volume ~ Girth, data = trees)
Coefficients:
(Intercept) Girth
-36.943 5.066
height_lm
Call:
lm(formula = Volume ~ Height, data = trees)
Coefficients:
(Intercept) Height
-87.124 1.543
Volume and Diameter (Girth): using the lm function, we can see that the girth slope coefficient is positive (+5.066), meaning positive association between girth and volume
Volume and Height: using the lm function, we can see that the height slope coefficient is positive (+1.543), meaning positive association between height and volume
The diameter seems to have a stronger positive correlation with volume than height
Suppose you have height and diameter measurements
Which of these variables would be preferable to use
Explain your reasoning
As mentioned before, when looking at the line of association and lower spread within the graph, and comparing with the outputs between the lm function, it’s more preferable to use diameter measurements to predict the volume of timber for the tree using a linear regression model.
OIS v3 8.16 Challenger disaster, Part I.
On January 28, 1986, a routine launch was anticipated for the Challenger space shuttle.
Seventy-three seconds into the flight, disaster happened:
An investigation into the cause of the disaster focused on a critical seal called an O-ring, and it is believed that damage to these O-rings during a shuttle launch may be related to the ambient temperature during the launch.
The orings.txt file in the data subfolder
Temp gives the temperature in Fahrenheit,Damaged represents the number of damaged O-rings.o_rings <- read.table('data/orings.txt', header = TRUE)
Visualize the data. what relationships do you observe between temperature and failure?
ggplot(o_rings, aes(x = temp, y = damage)) + ggtitle("damage by temp") + geom_point() + geom_smooth() + stat_smooth(method = 'glm', family = 'binomial')
it can be seen from the plot that there is a converging (exponentially decaying/logarithmic) relationship between temperature and damage (failure)
Create a logistic regression model.
glm() functionlogistic_challenger <- glm(temp~damage, data = o_rings)
summary(logistic_challenger)
Call:
glm(formula = temp ~ damage, data = o_rings)
Deviance Residuals:
Min 1Q Median 3Q Max
-10.3919 -4.4747 0.4426 3.9426 9.4426
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 71.557 1.272 56.248 < 2e-16 ***
damage -4.166 1.096 -3.801 0.00104 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for gaussian family taken to be 30.90637)
Null deviance: 1095.65 on 22 degrees of freedom
Residual deviance: 649.03 on 21 degrees of freedom
AIC: 148.09
Number of Fisher Scoring iterations: 2
Based on the model, do you think concerns regarding O-rings are justified? Explain. what does the p-value tell you?
Damage estimate is -4.166, negative exponential or logarithmic relationship. The logistic regression indicates a lower amount of damaged O-rings for higher temperatures. From the p-value (0.00104 < \(\alpha\) = 0.05), meaning there is an logistic association between damaged O-rings and temperature.
What assumption has to be made for logistic regression to be valid in this case?
The assumption that the relationship with the logarithmic value of the relationship between the damaged O-rings and temperature being linear and independent variable outcomes has to be made for logistic regression to be valid. From what we can see in the graph and the glm function, this assumption has been satisfied and makes the logistic regression being performed valid.
Let’s make some logistic models using Palmer’s penguins.
We will used the package nnet
nnet::multinom()
stats::lm()?nnet::multinom() for more helpnnet::multinom() is used to build multiple logistic
models
library(nnet)
library(caret)
glimpse(palmerpenguins::penguins)
Rows: 344
Columns: 8
$ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
$ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgersen, Torgersen, Torgersen, Torgersen, Torg…
$ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, 42.0, 37.8, 37.8, 41.1, 38.6, 34.6, 36.6…
$ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, 20.2, 17.1, 17.3, 17.6, 21.2, 21.1, 17.8…
$ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186, 180, 182, 191, 198, 185, 195, 197, 184…
$ body_mass_g <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, 4250, 3300, 3700, 3200, 3800, 4400, 3700…
$ sex <fct> male, female, female, NA, female, male, female, male, NA, NA, NA, NA, female, male, male, fe…
$ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 20…
df_penguins <- palmerpenguins::penguins
Processing the data
caret::createDataPartiton()# perform and 80/20 split on penguins data
# goal: determine species of penguins
df_penguins$species <- as.factor(df_penguins$species)
partition <- createDataPartition(df_penguins$species,
p = 0.8,
list = FALSE)
penguins_train <- df_penguins[partition, ]
penguins_test <- df_penguins[-partition, ]
test_species <- penguins_test$species
penguins_test <- penguins_test[ , -1]
glimpse(penguins_train)
Rows: 277
Columns: 8
$ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
$ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgersen, Torgersen, Torgersen, Torgersen, Torg…
$ bill_length_mm <dbl> 39.5, 40.3, 36.7, 39.2, 34.1, 42.0, 37.8, 37.8, 41.1, 38.6, 34.6, 36.6, 42.5, 46.0, 37.8, 37…
$ bill_depth_mm <dbl> 17.4, 18.0, 19.3, 19.6, 18.1, 20.2, 17.1, 17.3, 17.6, 21.2, 21.1, 17.8, 20.7, 21.5, 18.3, 18…
$ flipper_length_mm <int> 186, 195, 193, 195, 193, 190, 186, 180, 182, 191, 198, 185, 197, 194, 174, 180, 189, 185, 18…
$ body_mass_g <int> 3800, 3250, 3450, 4675, 3475, 4250, 3300, 3700, 3200, 3800, 4400, 3700, 4500, 4200, 3400, 36…
$ sex <fct> female, female, female, male, NA, NA, NA, NA, female, male, male, female, male, male, female…
$ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 20…
glimpse(penguins_test)
Rows: 67
Columns: 7
$ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgersen, Torgersen, Biscoe, Dream, Dream, Drea…
$ bill_length_mm <dbl> 39.1, NA, 39.3, 38.9, 38.7, 34.4, 38.8, 39.5, 38.8, 40.8, 44.1, 39.6, 37.5, 35.7, 37.6, 45.8…
$ bill_depth_mm <dbl> 18.7, NA, 20.6, 17.8, 19.0, 18.4, 17.2, 17.8, 20.0, 18.4, 19.7, 18.8, 18.9, 16.9, 17.0, 18.9…
$ flipper_length_mm <int> 181, NA, 190, 181, 195, 184, 180, 188, 190, 195, 196, 190, 179, 185, 185, 197, 189, 190, 186…
$ body_mass_g <int> 3750, NA, 3650, 3625, 3450, 3325, 3800, 3300, 3950, 3900, 4400, 4600, 2975, 3150, 3600, 4150…
$ sex <fct> male, NA, male, female, female, female, male, female, male, male, male, male, NA, female, fe…
$ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2008, 2008, 20…
Build a logistic model
model_function_call(Response_Variable ~ .). will automatically use the rest of the
columns
Once you have obtained your model object
stats::predict()
set.seed(69)
# Build your model
penguins_model <- multinom(species ~ ., penguins_train, model = FALSE)
# weights: 30 (18 variable)
initial value 296.625318
iter 10 value 37.637790
iter 20 value 1.537253
iter 30 value 0.003887
final value 0.000061
converged
cat("\nmodel summary: \n")
model summary:
summary(penguins_model)
Call:
multinom(formula = species ~ ., data = penguins_train, model = FALSE)
Coefficients:
(Intercept) islandDream islandTorgersen bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sexmale
Chinstrap 0.1004982 82.93313 27.57785 16.60628 -23.30336 0.4062036 -0.03206725 -22.61880
Gentoo 0.2181655 -10.94114 -44.44507 11.09083 -16.05708 -0.3629314 0.05432559 -33.56594
year
Chinstrap -0.1651320
Gentoo -0.1776766
Std. Errors:
(Intercept) islandDream islandTorgersen bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sexmale
Chinstrap 0.006426564 0.2000826 NaN 1.7912758 1.1205803 39.210957 2.236183 0.1697706
Gentoo 0.030055185 0.1696553 1.554182e-18 0.2950739 0.7760304 5.910384 25.562279 0.1697717
year
Chinstrap 7.18726
Gentoo 60.35069
Residual Deviance: 0.000122519
AIC: 36.00012
# Predict classes
predictions <- predict(penguins_model, penguins_test)
# predictions vs actual
cat("\npredicted: \n")
predicted:
predictions
[1] Adelie <NA> Adelie Adelie Adelie Adelie Adelie Adelie Adelie Adelie Adelie
[12] Adelie <NA> Adelie Adelie Adelie Adelie Adelie Adelie Adelie Adelie Adelie
[23] Adelie Adelie Adelie Adelie Adelie Adelie Adelie Adelie Gentoo Gentoo Gentoo
[34] Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo
[45] Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo <NA> Gentoo <NA> Chinstrap
[56] Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap
[67] Chinstrap
Levels: Adelie Chinstrap Gentoo
cat("\nactual: \n")
actual:
test_species
[1] Adelie Adelie Adelie Adelie Adelie Adelie Adelie Adelie Adelie Adelie Adelie
[12] Adelie Adelie Adelie Adelie Adelie Adelie Adelie Adelie Adelie Adelie Adelie
[23] Adelie Adelie Adelie Adelie Adelie Adelie Adelie Adelie Gentoo Gentoo Gentoo
[34] Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo
[45] Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo Chinstrap
[56] Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap
[67] Chinstrap
Levels: Adelie Chinstrap Gentoo
ANSWER: Looking at the output of the neural net model, we can see all the metadata for training, what general activation function (softmax function) is used, and other inputs that can be modified (weights as well). Using a seed here to ensure same responses after multiple reruns
Evaluate the accuracy of your model against test data
caret::confusionMatrix()
# comparison of predictions vs actual
confusionMatrix(predictions, reference = test_species)
Confusion Matrix and Statistics
Reference
Prediction Adelie Chinstrap Gentoo
Adelie 28 0 0
Chinstrap 0 13 0
Gentoo 0 0 22
Overall Statistics
Accuracy : 1
95% CI : (0.9431, 1)
No Information Rate : 0.4444
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 1
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: Adelie Class: Chinstrap Class: Gentoo
Sensitivity 1.0000 1.0000 1.0000
Specificity 1.0000 1.0000 1.0000
Pos Pred Value 1.0000 1.0000 1.0000
Neg Pred Value 1.0000 1.0000 1.0000
Prevalence 0.4444 0.2063 0.3492
Detection Rate 0.4444 0.2063 0.3492
Detection Prevalence 0.4444 0.2063 0.3492
Balanced Accuracy 1.0000 1.0000 1.0000
Are there any things that were commonly misclassified?
ANSWER:
The model was able to predict the species with 96.97% (close to 100%) accuracy. There wasn’t really anything commonly misclassified, and looking at the statistics from the confusion matrix, only 2 of the 13 variables were improperly classified (2 Chinstraps were classified as Adelie), making chinstrap have a sensitivity at 84%. Probably one of the best ways to improve the model is setting custom initial weights when training to ensure quick convergence, thus eliminating the weights to be set at random, which may or may not be good.
We will be working with the Huston crime data file provided by the
ggmap package, ggmap::crimes.
This CSV file contains the location (latitude and longitude) for crimes reported from January 2010 - August 2010
library(ggmap)
rgdal_show_exportToProj4_warnings = "none"
library(sp)
library(rgdal)
library(leaflet)
library(lubridate)
library(RColorBrewer)
library(classInt)
library(tidyverse)
library(Rfast)
Exploratory Data Analysis (EDA)
Let’s do some exploratory data analysis on the data we’ll start with the temporal aspect of the crime data. What can we say about when people commit crimes?
What trends do you see looking at different time frames?
what months have particularly high crime rates?
what times of day have increased crime rates?
what days of the week have higher crime rates?
produce three different visuals that represent each of these trends.
(hint: histograms are helpful for showing distributions)
Which of these trends could you have predicted? Does anything surprise you?
Are there any relationships between types of crime and time of day? Produce a stacked histogram and comment on the results
# Load Data
houston_crime_report <- ggmap::crime
# fields:
# time, date, hour, month, day
# premise, offense, beat, number
# block, street, type, suffix, location, address
# lon, lat
# copy:
crime_report <- houston_crime_report
# Crime per Day
ggplot(crime_report, aes(x = day, fill = as.factor(offense))
) + geom_histogram(stat = "count"
) + ggtitle("crimes per day"
) + xlab("day"
)
# Lets Do some EDA
# Looking at the types of Crime on a given day
plot_desired_crimes_for_day <- function(desired_day) {
crimes_for_day <- subset(crime_report, day == desired_day)
crimes_tit <- paste("types of crimes for", desired_day, sep = " ")
ggplot(crimes_for_day, aes(x = offense)
) + geom_histogram(stat = "count"
) + ggtitle(crimes_tit
) + xlab("offense")
}
plot_desired_crimes_for_day("monday")
plot_desired_crimes_for_day("tuesday")
plot_correlation_for_day <- function(desired_day) {
crimes_for_day <- subset(crime_report, day == desired_day)
crimes_tit <- paste("types of crimes for",
desired_day,
"by hour",
sep = " ")
ggplot(crimes_for_day, aes(x = hour, fill = as.factor(offense))
) + geom_histogram(stat = "count", position = "stack"
) + ggtitle(crimes_tit
) + xlab("offense")
}
plot_correlation_for_day("monday")
plot_correlation_for_day("tuesday")
# Now lets look at the Month
ggplot(crime_report, aes(x = month, fill = as.factor(offense))
) + geom_histogram(stat = "count"
) + ggtitle("crimes per month"
) + xlab("month")
plot_desired_crimes_for_month <- function(desired_month) {
crimes_for_month <- subset(crime_report, month == desired_month)
crimes_tit <- paste("types of crimes for", desired_month, sep = " ")
ggplot(crimes_for_month, aes(x = offense)
) + geom_histogram(stat = "count"
) + ggtitle(crimes_tit
) + xlab("offense")
}
plot_desired_crimes_for_month("january")
plot_desired_crimes_for_month("february")
plot_desired_crimes_for_month("march")
ANSWER:
What we can say about when people commit crimes:
Most months and days follow a similar pattern.
Theft seems to be the most common offense, with burglary coming in after
The month May seems to have the highest crime rates
Towards the afternoon into the evening, there are increased crime rates
Thursday, Friday, Saturday have generally higher crime rates than other days
I honestly couldn’t have predicted any of this, and none surprise me
There is a normal relationship between types of crime and time of day
library(sp)
library(rgdal)
library(maptools)
Geospatial Analysis
Next we’ll look at the spatial distribution of this data.
source = "stamen"ggmap::get_map()c(left = '' , right = '' , top = '', bottom = '')#Get the bbox
bbox <- c(left = -96,
right = -95,
top = 30.25,
bottom = 29.25)
#Retrieve the map
houstonmap <- get_map(location = bbox,
source = 'stamen')
#Plot
crime_report$offense <- as.factor(crime_report$offense)
ggmap(houstonmap) + geom_point(aes(x = lon, y = lat, color = offense), data = crime_report)
ANSWER: plotted
In the last map, it was a bit tricky
We’re going to now modify the incident occurrence layer
ggplot2::stat_density2d() functionggplot2::geom_point().#Plot using stat_density2d()
ggmap(houstonmap) + stat_density2d(aes(x = lon,
y = lat,
color = offense),
data = crime_report)
ANSWER:
..level.. do..level.. do
ggplot2::stat_density2d() function call?ggmap(houstonmap) + stat_density2d(aes(x = lon,
y = lat,
color = offense,
fill = ..level..),
data = crime_report
) + stat_density2d(aes(
fill = ..level..)
)
ANSWER: ..level.. tells ggmap to reference the column in a newly built data frame, and sketches contour lines
#Filter
library(broom)
library(ggplot2)
Finally
We will use a new package that assists in geospatial analysis
rdgal.shp
files
.shp files contain information about regions on a
map.shp files can contain the information
Add Polygons for the specific Neighborhoods
rdgal::readOGR()mapdata::readShapeSpatial()nbhd_path <- 'data/shp/COH_SUPER_NEIGHBORHOODS.shp'
nbhd_file <- readOGR(nbhd_path)
OGR data source with driver: ESRI Shapefile
Source: "/Users/momo/Documents/GitRem/22f-dsci351-451-mxd601/1-assignments/lab-exercise/LE7/data/shp/COH_SUPER_NEIGHBORHOODS.shp", layer: "COH_SUPER_NEIGHBORHOODS"
with 88 features
It has 14 fields
nbhds <- tidy(nbhd_file)
names(nbhd_file)
[1] "OBJECTID" "PERIMETER" "POLYID" "SNBNAME" "COUNCIL_AC" "RECOGNITIO" "SnbrInfoUR" "WeCan"
[9] "Top10" "CEA_FLAG" "cohgis_COH" "cohgis_C_1" "ShapeSTAre" "ShapeSTLen"
auto_thefts <- subset(crime_report, offense == 'auto theft')
filtered_auto_thefts <- as.data.frame(as.tibble(auto_thefts) %>% select(number, lon, lat, block, premise))
id_name_map <- data.frame(
id = nbhd_file$POLYID,
name = nbhd_file$SNBNAME
)
nbhds <- merge(nbhds, id_name_map, by.x = 'id')
What is the most dangerous Neighborhood for your crime? Where is the safest neighborhood?
Label the map with the neighborhoods
ggplot2::geom_text() has a built in function for this
overlap = TRUEggmap(houstonmap
) + geom_polygon(data = nbhds,
mapping = aes(x = long,
y = lat,
group = group,
fill = id),
show.legend = FALSE
) + geom_point(data = auto_thefts,
mapping = aes(x = lon,
y = lat,
),
color = "lightgrey",
show.legend = FALSE
) + geom_text(data = nbhds,
mapping = aes(x = long,
y = lat,
label = name),
show.legend = FALSE,
check_overlap = TRUE,
# vjust = "inward",
# hjust = "inward"
)
ANSWER:
Looks like midtown-willowbrook, and kingwood area are some of the most dangerous neighborhoods for auto-theft, whereas fondren gardens, willowbend area, westbranch, and central southwest are some of the safest.